IEICE global.ieice.org Site

Keyword Search Result

[Keyword] Markov model(95hit)

41-60hit(95hit)

A Fully Consistent Hidden Semi-Markov Model-Based Speech Recognition System
Keiichiro OURA Heiga ZEN Yoshihiko NANKAKU Akinobu LEE Keiichi TOKUDA

PAPER-Speech and Hearing

Vol:
E91-D No:11
Page(s):
2693-2700
In a hidden Markov model (HMM), state duration probabilities decrease exponentially with time, which fails to adequately represent the temporal structure of speech. One of the solutions to this problem is integrating state duration probability distributions explicitly into the HMM. This form is known as a hidden semi-Markov model (HSMM). However, though a number of attempts to use HSMMs in speech recognition systems have been proposed, they are not consistent because various approximations were used in both training and decoding. By avoiding these approximations using a generalized forward-backward algorithm, a context-dependent duration modeling technique and weighted finite-state transducers (WFSTs), we construct a fully consistent HSMM-based speech recognition system. In a speaker-dependent continuous speech recognition experiment, our system achieved about 9.1% relative error reduction over the corresponding HMM-based system.
HMM-Based Mask Estimation for a Speech Recognition Front-End Using Computational Auditory Scene Analysis
Ji Hun PARK Jae Sam YOON Hong Kook KIM

LETTER-Speech and Hearing

Vol:
E91-D No:9
Page(s):
2360-2364
In this paper, we propose a new mask estimation method for the computational auditory scene analysis (CASA) of speech using two microphones. The proposed method is based on a hidden Markov model (HMM) in order to incorporate an observation that the mask information should be correlated over contiguous analysis frames. In other words, HMM is used to estimate the mask information represented as the interaural time difference (ITD) and the interaural level difference (ILD) of two channel signals, and the estimated mask information is finally employed in the separation of desired speech from noisy speech. To show the effectiveness of the proposed mask estimation, we then compare the performance of the proposed method with that of a Gaussian kernel-based estimation method in terms of the performance of speech recognition. As a result, the proposed HMM-based mask estimation method provided an average word error rate reduction of 61.4% when compared with the Gaussian kernel-based mask estimation method.
Random Texture Defect Detection Using 1-D Hidden Markov Models Based on Local Binary Patterns
Hadi HADIZADEH Shahriar BARADARAN SHOKOUHI

PAPER

Vol:
E91-D No:7
Page(s):
1937-1945
In this paper a novel method for the purpose of random texture defect detection using a collection of 1-D HMMs is presented. The sound textural content of a sample of training texture images is first encoded by a compressed LBP histogram and then the local patterns of the input training textures are learned, in a multiscale framework, through a series of HMMs according to the LBP codes which belong to each bin of this compressed LBP histogram. The hidden states of these HMMs at different scales are used as a texture descriptor that can model the normal behavior of the local texture units inside the training images. The optimal number of these HMMs (models) is determined in an unsupervised manner as a model selection problem. Finally, at the testing stage, the local patterns of the input test image are first predicted by the trained HMMs and a prediction error is calculated for each pixel position in order to obtain a defect map at each scale. The detection results are then merged by an inter-scale post fusion method for novelty detection. The proposed method is tested with a database of grayscale ceramic tile images.
View Invariant Human Action Recognition Based on Factorization and HMMs
Xi LI Kazuhiro FUKUI

PAPER

Vol:
E91-D No:7
Page(s):
1848-1854
This paper addresses the problem of view invariant action recognition using 2D trajectories of landmark points on human body. It is a challenging task since for a specific action category, the 2D observations of different instances might be extremely different due to varying viewpoint and changes in speed. By assuming that the execution of an action can be approximated by dynamic linear combination of a set of basis shapes, a novel view invariant human action recognition method is proposed based on non-rigid matrix factorization and Hidden Markov Models (HMMs). We show that the low dimensional weight coefficients of basis shapes by measurement matrix non-rigid factorization contain the key information for action recognition regardless of the viewpoint changing. Based on the extracted discriminative features, the HMMs is used for temporal dynamic modeling and robust action classification. The proposed method is tested using real life sequences and promising performance is achieved.
Performance Analysis of IEEE 802.11 DCF and IEEE 802.11e EDCA in Non-saturation Condition
Tae Ok KIM Kyung Jae KIM Bong Dae CHOI

PAPER-Terrestrial Radio Communications

Vol:
E91-B No:4
Page(s):
1122-1131
We analyze the MAC performance of the IEEE 802.11 DCF and 802.11e EDCA in non-saturation condition where device does not have packets to transmit sometimes. We assume that a flow is not generated while the previous flow is in service and the number of packets in a flow is geometrically distributed. In this paper, we take into account the feature of non-saturation condition in standards: possibility of transmission performed without preceding backoff procedure for the first packet arriving at the idle station. Our approach is to model a stochastic behavior of one station as a discrete time Markov chain. We obtain four performance measures: normalized channel throughput, average packet HoL (head of line) delay, expected time to complete transmission of a flow and packet loss probability. Our results can be used for admission control to find the optimal number of stations with some constraints on these measures.
Joint Blind Super-Resolution and Shadow Removing
Jianping QIAO Ju LIU Yen-Wei CHEN

PAPER-Image Processing and Video Processing

Vol:
E90-D No:12
Page(s):
2060-2069
Most learning-based super-resolution methods neglect the illumination problem. In this paper we propose a novel method to combine blind single-frame super-resolution and shadow removal into a single operation. Firstly, from the pattern recognition viewpoint, blur identification is considered as a classification problem. We describe three methods which are respectively based on Vector Quantization (VQ), Hidden Markov Model (HMM) and Support Vector Machines (SVM) to identify the blur parameter of the acquisition system from the compressed/uncompressed low-resolution image. Secondly, after blur identification, a super-resolution image is reconstructed by a learning-based method. In this method, Logarithmic-wavelet transform is defined for illumination-free feature extraction. Then an initial estimation is obtained based on the assumption that small patches in low-resolution space and patches in high-resolution space share a similar local manifold structure. The unknown high-resolution image is reconstructed by projecting the intermediate result into general reconstruction constraints. The proposed method simultaneously achieves blind single-frame super-resolution and image enhancement especially shadow removal. Experimental results demonstrate the effectiveness and robustness of our method.
A Style Control Technique for HMM-Based Expressive Speech Synthesis
Takashi NOSE Junichi YAMAGISHI Takashi MASUKO Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E90-D No:9
Page(s):
1406-1413
This paper describes a technique for controlling the degree of expressivity of a desired emotional expression and/or speaking style of synthesized speech in an HMM-based speech synthesis framework. With this technique, multiple emotional expressions and speaking styles of speech are modeled in a single model by using a multiple-regression hidden semi-Markov model (MRHSMM). A set of control parameters, called the style vector, is defined, and each speech synthesis unit is modeled by using the MRHSMM, in which mean parameters of the state output and duration distributions are expressed by multiple-regression of the style vector. In the synthesis stage, the mean parameters of the synthesis units are modified by transforming an arbitrarily given style vector that corresponds to a point in a low-dimensional space, called style space, each of whose coordinates represents a certain specific speaking style or emotion of speech. The results of subjective evaluation tests show that style and its intensity can be controlled by changing the style vector.
Dynamic Bayesian Network Inversion for Robust Speech Recognition
Lei XIE Hongwu YANG

LETTER-Speech and Hearing

Vol:
E90-D No:7
Page(s):
1117-1120
This paper presents an inversion algorithm for dynamic Bayesian networks towards robust speech recognition, namely DBNI, which is a generalization of hidden Markov model inversion (HMMI). As a dual procedure of expectation maximization (EM)-based model reestimation, DBNI finds the 'uncontaminated' speech by moving the input noisy speech to the Gaussian means under the maximum likelihood (ML) sense given the DBN models trained on clean speech. This algorithm can provide both the expressive advantage from DBN and the noise-removal feature from model inversion. Experiments on the Aurora 2.0 database show that the hidden feature model (a typical DBN for speech recognition) with the DBNI algorithm achieves superior performance in terms of word error rate reduction.
A Hidden Semi-Markov Model-Based Speech Synthesis System
Heiga ZEN Keiichi TOKUDA Takashi MASUKO Takao KOBAYASIH Tadashi KITAMURA

PAPER-Speech and Hearing

Vol:
E90-D No:5
Page(s):
825-834
A statistical speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of speech are modeled simultaneously by context-dependent HMMs, and speech parameter vector sequences are generated from the HMMs themselves. This system defines a speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the synthesized speech sound less natural. In this paper, we propose a statistical speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of synthesized speech.
State Duration Modeling for HMM-Based Speech Synthesis
Heiga ZEN Takashi MASUKO Keiichi TOKUDA Takayoshi YOSHIMURA Takao KOBAYASIH Tadashi KITAMURA

LETTER-Speech and Hearing

Vol:
E90-D No:3
Page(s):
692-693
This paper describes the explicit modeling of a state duration's probability density function in HMM-based speech synthesis. We redefine, in a statistically correct manner, the probability of staying in a state for a time interval used to obtain the state duration PDF and demonstrate improvements in the duration of synthesized speech.
A Systolic FPGA Architecture of Two-Level Dynamic Programming for Connected Speech Recognition
Yong KIM Hong JEONG

PAPER-Speech and Hearing

Vol:
E90-D No:2
Page(s):
562-568
In this paper, we present an efficient architecture for connected word recognition that can be implemented with field programmable gate array (FPGA). The architecture consists of newly derived two-level dynamic programming (TLDP) that use only bit addition and shift operations. The advantages of this architecture are the spatial efficiency to accommodate more words with limited space and the absence of multiplications to increase computational speed by reducing propagation delays. The architecture is highly regular, consisting of identical and simple processing elements with only nearest-neighbor communication, and external communication occurs with the end processing elements. In order to verify the proposed architecture, we have also designed and implemented it, prototyping with Xilinx FPGAs running at 33 MHz.
Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training
Junichi YAMAGISHI Takao KOBAYASHI

PAPER-Speech and Hearing

Vol:
E90-D No:2
Page(s):
533-543
In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.
A Hybrid HMM/Kalman Filter for Tracking Hip Angle in Gait Cycle
Liang DONG Jiankang WU Xiaoming BAO

LETTER-Biological Engineering

Vol:
E89-D No:7
Page(s):
2319-2323
Movement of the thighs is an important factor for studying gait cycle. In this paper, a hybrid hidden Markov model (HMM)/Kalman filter (KF) scheme is proposed to track the hip angle during gait cycles. Within such a framework, HMM and KF work in parallel to estimate the hip angle and detect major gait events. This approach has been applied to study gait features of different subjects and compared with video based approach. Experimental results indicate that 1.) the swing angle of the hip can be detected with simple hardware configuration using biaxial accelerometers and 2.) the hip angle can be tracked for different subjects within the error range of -5°+5°.
HHMM Based Recognition of Human Activity
Daiki KAWANAKA Takayuki OKATANI Koichiro DEGUCHI

PAPER-Face, Gesture, and Action Recognition

Vol:
E89-D No:7
Page(s):
2180-2185
In this paper, we present a method for recognition of human activity as a series of actions from an image sequence. The difficulty with the problem is that there is a chicken-egg dilemma that each action needs to be extracted in advance for its recognition but the precise extraction is only possible after the action is correctly identified. In order to solve this dilemma, we use as many models as actions of our interest, and test each model against a given sequence to find a matched model for each action occurring in the sequence. For each action, a model is designed so as to represent any activity containing the action. The hierarchical hidden Markov model (HHMM) is employed to represent the models, in which each model is composed of a submodel of the target action and submodels which can represent any action, and they are connected appropriately. Several experimental results are shown.
A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features
Makoto TACHIBANA Junichi YAMAGISHI Takashi MASUKO Takao KOBAYASHI

PAPER-Speech Synthesis

Vol:
E89-D No:3
Page(s):
1092-1099
This paper proposes a technique for synthesizing speech with a desired speaking style and/or emotional expression, based on model adaptation in an HMM-based speech synthesis framework. Speaking styles and emotional expressions are characterized by many segmental and suprasegmental features in both spectral and prosodic features. Therefore, it is essential to take account of these features in the model adaptation. The proposed technique called style adaptation, deals with this issue. Firstly, the maximum likelihood linear regression (MLLR) algorithm, based on a framework of hidden semi-Markov model (HSMM) is presented to provide a mathematically rigorous and robust adaptation of state duration and to adapt both the spectral and prosodic features. Then, a novel tying method for the regression matrices of the MLLR algorithm is also presented to allow the incorporation of both the segmental and suprasegmental speech features into the style adaptation. The proposed tying method uses regression class trees with contextual information. From the results of several subjective tests, we show that these techniques can perform style adaptation while maintaining naturalness of the synthetic speech.
Training Augmented Models Using SVMs
Mark J.F. GALES Martin I. LAYTON

INVITED PAPER

Vol:
E89-D No:3
Page(s):
892-899
There has been significant interest in developing new forms of acoustic model, in particular models which allow additional dependencies to be represented than those contained within a standard hidden Markov model (HMM). This paper discusses one such class of models, augmented statistical models. Here, a local exponential approximation is made about some point on a base model. This allows additional dependencies within the data to be modelled than are represented in the base distribution. Augmented models based on Gaussian mixture models (GMMs) and HMMs are briefly described. These augmented models are then related to generative kernels, one approach used for allowing support vector machines (SVMs) to be applied to variable length data. The training of augmented statistical models within an SVM, generative kernel, framework is then discussed. This may be viewed as using maximum margin training to estimate statistical models. Augmented Gaussian mixture models are then evaluated using rescoring on a large vocabulary speech recognition task.
What HMMs Can Do
Jeff A. BILMES

INVITED PAPER

Vol:
E89-D No:3
Page(s):
869-891
Since their inception almost fifty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems--today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabilities, each of these ways having both advantages and disadvantages. In an effort to better understand what HMMs can do, this tutorial article analyzes HMMs by exploring a definition of HMMs in terms of random variables and conditional independence assumptions. We prefer this definition as it allows us to reason more throughly about the capabilities of HMMs. In particular, it is possible to deduce that there are, in theory at least, no limitations to the class of probability distributions representable by HMMs. This paper concludes that, in search of a model to supersede the HMM (say for ASR), rather than trying to correct for HMM limitations in the general case, new models should be found based on their potential for better parsimony, computational requirements, and noise insensitivity.
Genetic Algorithm Based Optimization of Partly-Hidden Markov Model Structure Using Discriminative Criterion
Tetsuji OGAWA Tetsunori KOBAYASHI

PAPER-Speech Recognition

Vol:
E89-D No:3
Page(s):
939-945
A discriminative modeling is applied to optimize the structure of a Partly-Hidden Markov Model (PHMM). PHMM was proposed in our previous work to deal with the complicated temporal changes of acoustic features. It can represent observation dependent behaviors in both observations and state transitions. In the formulation of the previous PHMM, we used a common structure for all models. However, it is expected that the optimal structure which gives the best performance differs from category to category. In this paper, we designed a new structure optimization method in which the dependence of the states and the observations of PHMM are optimally defined according to each model using the weighted likelihood-ratio maximization (WLRM) criterion. The WLRM criterion gives high discriminability between the correct category and the incorrect categories. Therefore it gives model structures with good discriminative performance. We define the model structure combination which satisfy the WLRM criterion for any possible structure combinations as the optimal structures. A genetic algorithm is also applied to the adequate approximation of a full search. With results of continuous lecture talk speech recognition, the effectiveness of the proposed structure optimization is shown: it reduced the word errors compared to HMM and PHMM with a common structure for all models.
Human Walking Motion Synthesis with Desired Pace and Stride Length Based on HSMM
Naotake NIWASE Junichi YAMAGISHI Takao KOBAYASHI

PAPER

Vol:
E88-D No:11
Page(s):
2492-2499
This paper presents a new technique for automatically synthesizing human walking motion. In the technique, a set of fundamental motion units called motion primitives is defined and each primitive is modeled statistically from motion capture data using a hidden semi-Markov model (HSMM), which is a hidden Markov model (HMM) with explicit state duration probability distributions. The mean parameter for the probability distribution function of HSMM is assumed to be given by a function of factors that control the walking pace and stride length, and a training algorithm, called factor adaptive training, is derived based on the EM algorithm. A parameter generation algorithm from motion primitive HSMMs with given control factors is also described. Experimental results for generating walking motion are presented when the walking pace and stride length are changed. The results show that the proposing technique can generate smooth and realistic motion, which are not included in the motion capture data, without the need for smoothing or interpolation.
Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing
Makoto TACHIBANA Junichi YAMAGISHI Takashi MASUKO Takao KOBAYASHI

PAPER

Vol:
E88-D No:11
Page(s):
2484-2491
This paper describes an approach to generating speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based speech synthesis. We first model several representative styles, each of which is a speaking style and/or an emotional expression, in an HMM-based speech synthesis framework. Then, to generate synthetic speech with an intermediate style from representative ones, we synthesize speech from a model obtained by interpolating representative style models using a model interpolation technique. We assess the style interpolation technique with subjective evaluation tests using four representative styles, i.e., neutral, joyful, sad, and rough in read speech and synthesized speech from models obtained by interpolating models for all combinations of two styles. The results show that speech synthesized from the interpolated model has a style in between the two representative ones. Moreover, we can control the degree of expressivity for speaking styles or emotions in synthesized speech by changing the interpolation ratio in interpolation between neutral and other representative styles. We also show that we can achieve style morphing in speech synthesis, namely, changing style smoothly from one representative style to another by gradually changing the interpolation ratio.

41-60hit(95hit)

Keyword Search Result

[Keyword] Markov model(95hit)

A Fully Consistent Hidden Semi-Markov Model-Based Speech Recognition System

HMM-Based Mask Estimation for a Speech Recognition Front-End Using Computational Auditory Scene Analysis

Random Texture Defect Detection Using 1-D Hidden Markov Models Based on Local Binary Patterns

View Invariant Human Action Recognition Based on Factorization and HMMs

Performance Analysis of IEEE 802.11 DCF and IEEE 802.11e EDCA in Non-saturation Condition

Joint Blind Super-Resolution and Shadow Removing

A Style Control Technique for HMM-Based Expressive Speech Synthesis

Dynamic Bayesian Network Inversion for Robust Speech Recognition

A Hidden Semi-Markov Model-Based Speech Synthesis System

State Duration Modeling for HMM-Based Speech Synthesis

A Systolic FPGA Architecture of Two-Level Dynamic Programming for Connected Speech Recognition

Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

A Hybrid HMM/Kalman Filter for Tracking Hip Angle in Gait Cycle

HHMM Based Recognition of Human Activity

A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features

Training Augmented Models Using SVMs

What HMMs Can Do

Genetic Algorithm Based Optimization of Partly-Hidden Markov Model Structure Using Discriminative Criterion

Human Walking Motion Synthesis with Desired Pace and Stride Length Based on HSMM

Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles